{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ur8xi4C7S06n"
      },
      "outputs": [],
      "source": [
        "# Copyright 2025 Google LLC\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JAPoU8Sm5E6e"
      },
      "source": [
        "# Serving Gemma 3 with vLLM on Cloud Run\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg\" alt=\"Google Colaboratory logo\"><br> Open in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fopen-models%2Fserving%2Fcloud_run_vllm_gemma3_inference.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN\" alt=\"Google Cloud Colab Enterprise logo\"><br> Open in Colab Enterprise\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\">\n",
        "      <img src=\"https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg\" alt=\"Vertex AI logo\"><br> Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "  <td style=\"text-align: center\">\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\"><br> View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/serving/cloud_run_vllm_gemma3_inference.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>            "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "83b98b0ba19c"
      },
      "source": [
        "<img src=\"https://docs.vllm.ai/en/latest/_images/vllm-logo-text-light.png\" width=\"200px\" alignment=\"center\"/>\n",
        "<img src=\"https://cloud.google.com/static/architecture/images/ac-page-icons/card_google_cloud_partner.svg\" width=\"100px\">\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "84f0f73a0f76"
      },
      "source": [
        "| | |\n",
        "|-|-|\n",
        "| Author(s) | [Vlad Kolesnikov](https://github.com/vladkol) |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ccd500ae19b5"
      },
      "source": [
        "## Overview"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvgnzT1CKxrO"
      },
      "source": [
        "> [**Gemma 3**](https://ai.google.dev/gemma) is a new generation of open models developed by Google. It is a collection of lightweight, state-of-the-art open models built from the same research and technology that powers our Gemini 2.0 models. Gemma 3 comes in a range of sizes (1B, 4B, 12B and 27B), allowing you to choose the best model for your specific hardware and performance needs. Gemma 3 models are available through platforms like Google AI Studio, Kaggle, and Hugging Face.\n",
        "\n",
        "> **[Cloud Run](https://cloud.google.com/run)**:\n",
        "It's a serverless platform by Google Cloud for running containerized applications. It automatically scales and manages infrastructure, supporting various programming languages. Cloud Run now offers GPU acceleration for AI/ML workloads. With 30 seconds to the first token, Cloud Run is a perfect platform for serving lightweight models like Gemma.\n",
        "\n",
        "> **Note:** GPU support in Cloud Run is in preview. To use the GPU feature, you must request `Total Nvidia L4 GPU allocation, per project per region` quota under Cloud Run in the [Quotas and system limits page](https://cloud.google.com/run/quotas#increase).\n",
        "\n",
        "\n",
        "> **[vLLM](https://docs.vllm.ai/)**: is a fast and easy-to-use library for LLM inference and serving. It provides high-throughput serving, optimized model execution, and tensor parallelism.\n",
        "\n",
        "This notebook showcase how to deploy [Google Gemma 3](https://developers.googleblog.com/en/introducing-gemma3) in Cloud Run, with the objective to build a simple API for chat or RAG applications.\n",
        "\n",
        "By the end of this notebook, you will learn how to:\n",
        "\n",
        "1. Deploy Google Gemma 3 as an OpenAI-compatible API on Cloud Run using vLLM.\n",
        "2. Build a custom container with vLLM to deploy any Large Language Model (LLM) of your choice.\n",
        "3. Make requests to an API hosted on Cloud Run."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "61RBz8LLbxCR"
      },
      "source": [
        "## Get started"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "No17Cw5hgx12"
      },
      "source": [
        "### Install Google Cloud SDK and HuggingFace Hub dependencies\n",
        "\n",
        "Make sure you Google Cloud SDK is installed (try running `gcloud version`) or [install it](https://cloud.google.com/sdk/docs/install) before executing this notebook.\n",
        "\n",
        "> If you are running in Colab or Vertex AI workbench, you have Google Cloud SDK installed.\n",
        "\n",
        "You also need to install `huggingface_hub` Python package."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Tm31a7_cDbeu"
      },
      "outputs": [],
      "source": [
        "%pip install --upgrade --quiet huggingface_hub"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MajzyxmFDbeu"
      },
      "source": [
        "### Choose a model, a project, and a region to host the model\n",
        "\n",
        "Choose a Gemma 3 model to use, a Google Cloud project to host your Cloud Run service, and a region to host it in.\n",
        "\n",
        "If you don't have a project yet:\n",
        "\n",
        "1. [Create a project](https://console.cloud.google.com/projectcreate) in the Google Cloud Console.\n",
        "2. Copy your `Project ID` from the project's [Settings page](https://console.cloud.google.com/iam-admin/settings).\n",
        "\n",
        "The project must have `Total Nvidia L4 GPU allocation, per project per region` quota allocated in the selected region.\n",
        "To make sure it's available, check Cloud Run in the [Quotas and system limits page](https://console.cloud.google.com/iam-admin/quotas)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "Do_Z5bVNDbeu"
      },
      "outputs": [],
      "source": [
        "# { display-mode: \"form\", run: \"auto\" }\n",
        "\n",
        "MODEL = \"google/gemma-3-4b-it\"  # @param {type:\"string\"}\n",
        "\n",
        "PROJECT_ID = \"[your-project-id]\"  # @param {type:\"string\", isTemplate: true}\n",
        "REGION = \"us-central1\"  # @param {type:\"string\", isTemplate: true}\n",
        "\n",
        "if PROJECT_ID == \"[your-project-id]\" or not PROJECT_ID:\n",
        "    print(\"Please specify your project id in PROJECT_ID variable.\")\n",
        "    raise KeyboardInterrupt\n",
        "\n",
        "SERVICE_NAME = f\"vllm--{MODEL.replace('.', '-').replace('/','--')}\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dmWOrTJ3gx13"
      },
      "source": [
        "### Authenticate your Google Cloud account\n",
        "\n",
        "Depending on your Jupyter environment, you may have to manually authenticate. Run the cell below."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c93d96bc9c43"
      },
      "outputs": [],
      "source": [
        "!gcloud auth print-identity-token -q &> /dev/null || gcloud auth login --project=\"{PROJECT_ID}\" --update-adc --quiet"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f9af3e57f89a"
      },
      "source": [
        "### Authenticate your Hugging Face account\n",
        "\n",
        "As Google Gemma (https://huggingface.co/google/gemma-3-4b-it) are gated models, you need to have a Hugging Face Hub account, and accept the Google's usage license for Gemma. Once that's done, you need to generate a new user access token with read-only access so that the weights can be downloaded from the Hub.\n",
        "\n",
        "> Note that the user access token can only be generated via [the Hugging Face Hub UI](https://huggingface.co/settings/tokens/new), where you can either select read-only access to your account, or follow the recommendations and generate a fine-grained token with read-only access to [`google/gemma-3-4b-it`](https://huggingface.co/google/gemma-3-4b-it).\n",
        ">\n",
        "> Even you had an access token before accepting the Gemma license, you need to generate a new one after accepting it.\n",
        "\n",
        "Then you can use `huggingface_hub` for the authentication with the token generated in advance. So that then the token can be safely retrieved via `huggingface_hub.get_token`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8d836e0210fe"
      },
      "outputs": [],
      "source": [
        "from huggingface_hub import get_token, interpreter_login\n",
        "\n",
        "interpreter_login(new_session=False)\n",
        "HF_TOKEN = get_token()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c71a4314c250"
      },
      "source": [
        "Read more about [Hugging Face Security](https://huggingface.co/docs/hub/en/security), specifically about [Hugging Face User Access Tokens](https://huggingface.co/docs/hub/en/security-tokens)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gwSj4-j9Dbev"
      },
      "source": [
        "## Prepare container image\n",
        "\n",
        "First, let's create a Docker file for a container with the model embedded into it.\n",
        "\n",
        "It will use your Hugging Face Hub secret at build time."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8Pt7jloHDbev"
      },
      "outputs": [],
      "source": [
        "%%writefile Dockerfile\n",
        "\n",
        "FROM us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01\n",
        "\n",
        "ARG MODEL\n",
        "ENV HF_MODEL=$MODEL\n",
        "ENV MODEL_ID=$MODEL\n",
        "\n",
        "ENV HF_HOME=/model-cache\n",
        "# Download the model to the cache\n",
        "RUN --mount=type=secret,id=HF_TOKEN HF_TOKEN=$(cat /run/secrets/HF_TOKEN) \\\n",
        "    huggingface-cli download ${HF_MODEL}\n",
        "# Switch to offline mode to make sure vLLM always uses the cached model\n",
        "ENV HF_HUB_OFFLINE=1\n",
        "\n",
        "ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \\\n",
        "    --port ${PORT:-8080} \\\n",
        "    --model ${HF_MODEL} \\\n",
        "    --max-num-seqs=4 \\\n",
        "    --max-model-len 32768"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vZG5JfZRDbev"
      },
      "source": [
        "Second, we create a Cloud Build file to use for building and pushing our container image."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cRiA0VArDbev"
      },
      "outputs": [],
      "source": [
        "%%writefile cloudbuild.yaml\n",
        "\n",
        "steps:\n",
        "- name: 'gcr.io/cloud-builders/docker'\n",
        "  id: build\n",
        "  entrypoint: 'bash'\n",
        "  secretEnv: ['HF_TOKEN']\n",
        "  args:\n",
        "    - -c\n",
        "    - |\n",
        "        SECRET_TOKEN=\"$$HF_TOKEN\" docker buildx build --tag=${_IMAGE} --build-arg MODEL=${_MODEL} --secret id=HF_TOKEN .\n",
        "\n",
        "availableSecrets:\n",
        "  secretManager:\n",
        "  - versionName: 'projects/${PROJECT_ID}/secrets/${_HF_TOKEN_SECRET_NAME}/versions/latest'\n",
        "    env: 'HF_TOKEN'\n",
        "\n",
        "images: [\"${_IMAGE}\"]\n",
        "\n",
        "substitutions:\n",
        "  _IMAGE: '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_AR_REPO_NAME}/${_SERVICE_NAME}'\n",
        "\n",
        "options:\n",
        "  dynamicSubstitutions: true\n",
        "  machineType: \"E2_HIGHCPU_32\"\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9Vppnp8kDbev"
      },
      "source": [
        "## Build Container Image and Deploy Cloud Run Service\n",
        "\n",
        "We are ready to build our container image and deploy Cloud Run service.\n",
        "\n",
        "The script below performs the following actions:\n",
        "\n",
        "* Enables necessary APIs.\n",
        "* Creates an Artifact Repository for the image.\n",
        "* Creates a Service Account for the service.\n",
        "* Create a secret in Secret Manager that Cloud Build will use for building the container image.\n",
        "* Assigns Service Account permissions to access the secret.\n",
        "* Submits a Cloud Build job to create and push the container image.\n",
        "* Deploys the Cloud Run service.\n",
        "\n",
        "> The script may take 15-45 minutes to finish.\n",
        "\n",
        "Note the following important flags in Cloud Build deployment command:\n",
        "\n",
        "* `--gpu 1` with `--gpu-type nvidia-l4` assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.\n",
        "`--no-allow-authenticated` restricts unauthenticated access to the service.\n",
        "By keeping the service private, you can rely on Cloud Run's built-in [Identity and Access Management (IAM)](https://cloud.google.com/iam) authentication for service-to-service communication.\n",
        "* `--no-cpu-throttling` is required for enabling GPU.\n",
        "* `--service-account` the service identity of the service.\n",
        "* `--max-instances` sets maximum number of instances of the service.\n",
        "It has to be equal to or lower than your project's NVIDIA L4 GPU (`Total Nvidia L4 GPU allocation, per project per region`) quota."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xsbvKy-_Dbev"
      },
      "outputs": [],
      "source": [
        "%%writefile deploy.sh\n",
        "\n",
        "PROJECT_ID=$1\n",
        "REGION=$2\n",
        "MODEL_ID=\"${3}\"\n",
        "SERVICE_NAME=\"${5}\"\n",
        "AR_REPO_NAME=\"vllm-repo\"\n",
        "SERVICE_ACCOUNT=\"vllm-cloud-run-sa\"\n",
        "SERVICE_ACCOUNT_ADDRESS=\"${SERVICE_ACCOUNT}@$PROJECT_ID.iam.gserviceaccount.com\"\n",
        "HF_TOKEN_SECRET_NAME=\"HF_TOKEN_SECRET\"\n",
        "HF_TOKEN=$4\n",
        "MAX_INSTANCES=1 # Adjust this value to match your Cloud Run L4 GPU quota (\"Total Nvidia L4 GPU allocation, per project per region\", NvidiaL4GpuAllocPerProjectRegion, run.googleapis.com/nvidia_l4_gpu_allocation)\n",
        "\n",
        "echo \"Enabling APIs in project ${PROJECT_ID}.\"\n",
        "gcloud services enable run.googleapis.com \\\n",
        "    cloudbuild.googleapis.com \\\n",
        "    secretmanager.googleapis.com \\\n",
        "    artifactregistry.googleapis.com \\\n",
        "    --project ${PROJECT_ID} \\\n",
        "    --quiet\n",
        "\n",
        "set -e\n",
        "\n",
        "# Creating the service account if doesn't exist.\n",
        "sa_list=$(gcloud iam service-accounts list --quiet --format 'value(email)' --project $PROJECT_ID --filter=email:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com 2>/dev/null)\n",
        "if [ -z \"${sa_list}\" ]; then\n",
        "    echo \"Creating Service Account ${SERVICE_ACCOUNT}.\"\n",
        "    gcloud iam service-accounts create $SERVICE_ACCOUNT \\\n",
        "        --project ${PROJECT_ID} \\\n",
        "        --display-name=\"${SERVICE_ACCOUNT} - Cloud Run Service Account\"\n",
        "fi\n",
        "\n",
        "# Creating or updating the secret.\n",
        "secrets_list=$(gcloud secrets list --project=\"${PROJECT_ID}\"  --quiet --format 'value(name)' --filter=name:\"${HF_TOKEN_SECRET_NAME}\" 2>/dev/null)\n",
        "if [ -z \"${secrets_list}\" ]; then\n",
        "    echo \"Creating Secret ${HF_TOKEN_SECRET_NAME}.\"\n",
        "    echo \"${HF_TOKEN}\" | gcloud secrets create \"${HF_TOKEN_SECRET_NAME}\" --project=${PROJECT_ID} --data-file=-\n",
        "else\n",
        "    echo \"Updating Secret ${HF_TOKEN_SECRET_NAME}.\"\n",
        "    echo \"${HF_TOKEN}\" | gcloud secrets versions add \"${HF_TOKEN_SECRET_NAME}\" --project=${PROJECT_ID} --data-file=-\n",
        "fi\n",
        "\n",
        "# Getting Cloud Build service account and giving it access to the secret\n",
        "echo \"Applying Cloud Build account permissions.\"\n",
        "cloud_build_sa=$(gcloud builds get-default-service-account --format 'value(serviceAccountEmail)' --project ${PROJECT_ID} --quiet)\n",
        "cloud_build_sa_email=\"${cloud_build_sa##*/}\"\n",
        "gcloud secrets add-iam-policy-binding \"${HF_TOKEN_SECRET_NAME}\" \\\n",
        "    --member serviceAccount:\"${cloud_build_sa_email}\" \\\n",
        "    --role='roles/secretmanager.secretAccessor' \\\n",
        "    --project=${PROJECT_ID}\n",
        "\n",
        "# Creating the Artifacts Repository if doesn't exist\n",
        "repo_list=$(gcloud artifacts repositories list --format 'value(name)' --filter=name=\"projects/${PROJECT_ID}/locations/${REGION}/repositories/${AR_REPO_NAME}\" --project ${PROJECT_ID} --quiet --location ${REGION} 2>/dev/null)\n",
        "if [ -z \"${repo_list}\" ]; then\n",
        "    echo \"Creating Artifact Registry ${AR_REPO_NAME}.\"\n",
        "    gcloud artifacts repositories create $AR_REPO_NAME \\\n",
        "    --repository-format docker \\\n",
        "    --location ${REGION} \\\n",
        "    --project=${PROJECT_ID}\n",
        "fi\n",
        "\n",
        "echo \"Building container image.\"\n",
        "gcloud builds submit --config=cloudbuild.yaml --project=${PROJECT_ID} . \\\n",
        "    --suppress-logs \\\n",
        "    --substitutions \\\n",
        "  _AR_REPO_NAME=$AR_REPO_NAME,_REGION=$REGION,_SERVICE_NAME=$SERVICE_NAME,_MODEL=$MODEL_ID,_HF_TOKEN_SECRET_NAME=$HF_TOKEN_SECRET_NAME\n",
        "rm -f cloudbuild.yaml\n",
        "rm -f Dockerfile\n",
        "\n",
        "echo \"Deploying Service ${SERVICE_NAME}.\"\n",
        "gcloud beta run deploy $SERVICE_NAME \\\n",
        "    --project=${PROJECT_ID} \\\n",
        "    --image=${REGION}-docker.pkg.dev/$PROJECT_ID/$AR_REPO_NAME/$SERVICE_NAME \\\n",
        "    --service-account $SERVICE_ACCOUNT_ADDRESS \\\n",
        "    --cpu=8 \\\n",
        "    --memory=32Gi \\\n",
        "    --gpu=1 --gpu-type=nvidia-l4 \\\n",
        "    --region ${REGION} \\\n",
        "    --no-allow-unauthenticated \\\n",
        "    --max-instances ${MAX_INSTANCES} \\\n",
        "    --no-cpu-throttling \\\n",
        "    --timeout 1h\n",
        "\n",
        "SERVICE_URL=$(gcloud run services describe ${SERVICE_NAME} --project=${PROJECT_ID} --region $REGION --format 'value(status.url)' --quiet)\n",
        "echo \"✅ Success!\"\n",
        "echo \"🚀 Service URL: ${SERVICE_URL}\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eff7e1585597"
      },
      "outputs": [],
      "source": [
        "!/bin/bash ./deploy.sh \"{PROJECT_ID}\" \"{REGION}\" \"{MODEL}\" \"{HF_TOKEN}\" \"{SERVICE_NAME}\" && rm -f ./deploy.sh"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WzSFgfuyDbew"
      },
      "source": [
        "## Test the deployed service\n",
        "\n",
        "Now, let's test the service you deployed.\n",
        "\n",
        "First, simply by using `cURL`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1LDyxRmADbew"
      },
      "outputs": [],
      "source": [
        "%%bash -s {MODEL} {SERVICE_NAME} {PROJECT_ID} {REGION}\n",
        "\n",
        "PROMPT=\"Why is the sky blue?\"\n",
        "SERVICE_URL=$(gcloud run services describe ${2} --project ${3} --region ${4} --format 'value(status.url)' --quiet)\n",
        "AUTH_TOKEN=$(gcloud auth print-identity-token -q)\n",
        "\n",
        "curl \"${SERVICE_URL}/v1/chat/completions\" \\\n",
        "    -s \\\n",
        "    -X POST \\\n",
        "    -H \"Content-Type: application/json\" \\\n",
        "    -H \"Authorization: Bearer ${AUTH_TOKEN}\" \\\n",
        "    -d @- <<EOF | python3 -m json.tool\n",
        "    {\n",
        "        \"model\": \"$1\",\n",
        "        \"messages\": [\n",
        "            {\n",
        "            \"role\": \"user\",\n",
        "            \"content\": \"${PROMPT}\"\n",
        "            }\n",
        "        ]\n",
        "    }\n",
        "EOF"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bc1P8IcrDbew"
      },
      "source": [
        "### OpenAI Python API library\n",
        "\n",
        "vLLM serving container provides [OpenAI-compatible endpoint](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) you've just used. Let's test it with OpenAI Python API library."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "L5bVcdITDbew"
      },
      "outputs": [],
      "source": [
        "# Install OpenAI library\n",
        "%pip install openai -q"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1xNVHL-6Dbew"
      },
      "outputs": [],
      "source": [
        "import subprocess\n",
        "\n",
        "from IPython.display import Image, Markdown, display\n",
        "from openai import OpenAI\n",
        "\n",
        "prompt = \"What's in this image?\"\n",
        "image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Felis_catus-cat_on_snow.jpg/320px-Felis_catus-cat_on_snow.jpg\"\n",
        "\n",
        "identity_token = (\n",
        "    subprocess.check_output(\"gcloud auth print-identity-token -q\", shell=True)\n",
        "    .decode()\n",
        "    .strip()\n",
        ")\n",
        "service_url = (\n",
        "    subprocess.check_output(\n",
        "        (\n",
        "            \"gcloud run services describe \"\n",
        "            f\"{SERVICE_NAME} --project={PROJECT_ID} \"\n",
        "            f\"--region={REGION} \"\n",
        "            \"--format='value(status.url)' -q\"\n",
        "        ),\n",
        "        shell=True,\n",
        "    )\n",
        "    .decode()\n",
        "    .strip()\n",
        ")\n",
        "\n",
        "client = OpenAI(\n",
        "    api_key=\"EMPTY\",\n",
        "    base_url=f\"{service_url}/v1\",\n",
        ")\n",
        "chat_response = client.chat.completions.create(\n",
        "    model=MODEL,\n",
        "    messages=[\n",
        "        {\n",
        "            \"role\": \"user\",\n",
        "            \"content\": [\n",
        "                {\"type\": \"text\", \"text\": prompt},\n",
        "                {\n",
        "                    \"type\": \"image_url\",\n",
        "                    \"image_url\": {\n",
        "                        \"url\": image_url,\n",
        "                    },\n",
        "                },\n",
        "            ],\n",
        "        }\n",
        "    ],\n",
        "    temperature=0.5,\n",
        "    extra_headers=f\"Bearer {identity_token}\",\n",
        ")\n",
        "print(f\"Prompt: {prompt}\")\n",
        "display(Image(url=image_url))\n",
        "Markdown(chat_response.choices[0].message.content)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lLLwC3E3Dbew"
      },
      "source": [
        "## Conclusion\n",
        "Congratulations! Now you know how to deploy Gemma 3 with vLLM to Cloud Run powered by a GPU!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f6f17f9aff65"
      },
      "source": [
        "## Cleaning up"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TKNPU6JADbew"
      },
      "source": [
        "To delete the Cloud Run service you created, you can uncomment and run the following cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rxfg1S7pDbew"
      },
      "outputs": [],
      "source": [
        "# !gcloud run services delete $SERVICE_NAME --project $PROJECT_ID --region $LOCATION --quiet"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "cloud_run_vllm_gemma3_inference.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
